Learning Theory

Supervised Learning

Objective

For an unknown $c\in\mathcal{C}$, given

$x^{(1)},\ldots,x^{(m)}$ (“training data”), where each $x^{(i)}$ is drawn i.i.d. from a distribution $D$
$c(x^{(1)}),\ldots,c(x^{(m)})$ (“labels”)

find a hypothesis $h(x)$ such that

$h\in\mathcal{C}$
$\forall i\; h(x^{(i)}) = c(x^{(i)})$

Equivalently, define

\[\mathrm{Cons}_{\mathcal{C}}(X) := \lbrace h\in\mathcal{C} : \forall i\; h(x^{(i)}) = c(x^{(i)})\rbrace \]

and view learning as a relation/search problem: find $h\in \mathrm{Cons}_{\mathcal{C}}(X)$.

(Topics listed in lecture: Supervised Learning, Occam’s Razor, Elimination Algorithm.)

Example: Databases

Let $\mathcal{C}$ be “all databases” (i.e., all possible properties).

A naive algorithm:

If we observe $c(x^{(i)}) = 1$, add $x^{(i)}$ to our database hypothesis $h$.

Question: how does this $h$ perform on unobserved examples?

PAC Learning

We say $\mathcal{C}$ is PAC-learnable if there exists an algorithm such that for every $c\in\mathcal{C}$ and every distribution $D$, given parameters $n,\varepsilon,\delta$ it

uses $m = \mathrm{poly}(n, 1/\varepsilon, 1/\delta)$ labeled examples drawn i.i.d. from $D$, and
runs in $\mathrm{poly}(n, 1/\varepsilon, 1/\delta)$ time, and
with probability at least $1-\delta$ (“Probably”) returns $h\in\mathcal{C}$ such that $$\Pr_{x\sim D}[h(x)=c(x)] \ge 1-\varepsilon \quad \text{(“Approximately”).}$$

Proof pattern (union bound over “small” hypotheses)

Define $\hat h$ to be BAD if $$\Pr_{x\sim D}[\hat h(x)\ne c(x)] \ge \varepsilon.$$

Key calculation: if $\hat h$ is BAD, then the probability it is still consistent with $m$ i.i.d. samples is

\[\Pr[\hat h\in \mathrm{Cons}_{\mathcal{C}}(X)] \le (1-\varepsilon)^m \le e^{-\varepsilon m}.\]

If there are at most $2^{B(n,m)+1}$ “small” candidate hypotheses (bounded by some description-length term $B(n,m)$), then by a union bound the probability that any small BAD hypothesis remains consistent is at most

\[2^{B(n,m)+1}\cdot e^{-\varepsilon m}.\]

Choosing

$k = 1/\ln 2$ (to convert between $\ln$ and $\log_2$), and
$m$ large enough so that the above expression is $\le \delta$

gives the standard PAC-style sample bound.

Example: Survey / empirical validity

Suppose we have asked 20 household units at random from the city if they own or rent and what their income is.

“income > $300k or rent” and suppose all survey results satisfy this property

Let $A_i$ be that $i$-th survey satisfy the above property

suppose independent

$P(\bigcap_{i=1}^{20} A_i)=\prod^{20}_{i=1}P(A_i)=\frac{\text{\# households satisfy}}{\text{\# total householdes}}$

Suppose less than 90% of households satisfy the property

The probability would be $<0.9^{20}\approx0.11$

Thus, with $\approx89\%$ confidence that $\geq 90\%$ households satisfy the property

More Generally

We’ll consider domains correspond to distributions on sets of possible worlds. We say $C$ is $1-\varepsilon$ valid if $P_D(\text{not} C)\leq \varepsilon$. Independently draw 20 examples from $D$, $P(\text{all }20\text{ satisfy }C)\leq P_D(\text{not }C)^{20}$

Learning Conjunctions

Elimination Algorithm

Initialize $h(x)$ to contain all literals (so $2n$ literals total).

For each example $i$:

If $c(x^{(i)}) = 1$, remove from $h$ every literal $\ell$ such that $\ell(x^{(i)})=0$.

Return $h(x)$.

Theorem

Elimination runs in $O(mn)$ time and solves $\mathrm{Cons}_{\text{conj}}$ (when $\mathcal{C}$ is the class of all conjunctions).

Proof idea. Initially $h(x)$ contains all literals. If there exists a positive example where a literal is false, that literal cannot appear in the target conjunction, so it is safe to remove it. We never remove literals that truly belong to $c(x)$ because we only remove literals on examples with $c(x^{(i)})=1$.

A corresponding sample bound (as written in lecture) was of the form $$m = \frac{1}{k\varepsilon}\Big(2n + \log_2\tfrac{1}{\delta} + 1\Big), \qquad k=\frac{1}{\ln 2}.$$

Reduction from $\mathrm{Cons}$ to PAC

Lemma

If there is a PAC-learning algorithm for a class $\mathcal{C}$, then there is a polynomial-time algorithm for solving the corresponding consistency/search problem $\mathrm{Cons}_{\mathcal{C}}$ (for appropriately bounded instances).

Proof

Given a consistency instance with examples $X=\lbrace x^{(1)},\ldots,x^{(m)}\rbrace $, define the distribution

\[D = \mathrm{unif}\lbrace x^{(1)},\ldots,x^{(m)}\rbrace .\]

Run the PAC learner on $(\mathcal{C}, n, \varepsilon=\tfrac{1}{2m}, \delta=\tfrac12)$ using samples drawn i.i.d. from this $D$.

With probability at least $1-\delta=1/2$, the learner outputs $h$ such that

\[\Pr_{x\sim D}[h(x)=c(x)] \ge 1-\tfrac{1}{2m}.\]

But under a uniform distribution over exactly $m$ points, the error rate is an integer multiple of $1/m$. If it is $< 1/(2m)$, then it must be $0$, hence $h$ agrees with $c$ on every $x^{(i)}$, i.e. $h\in\mathrm{Cons}_{\mathcal{C}}(X)$.

Consequence

Reduction from $\mathrm{Cons}$ to PAC: if $\mathrm{Cons}$ is NP-complete, then so is PAC-learning.

Can PAC but not Consis?

The reduction is phrased as a discrete-gap argument.

Run PAC-learning on $(\mathcal C,n,\varepsilon=\tfrac{1}{2m},\delta=\tfrac12)$.
Given sample set $X=\lbrace x^{(1)},\ldots,x^{(m)}\rbrace $, define

\[D' = \mathrm{unif}\lbrace x^{(1)},\ldots,x^{(m)}\rbrace .\]

Question written: Can $h(x)$ satisfy PAC-learning but not satisfy $\mathrm{Consis}^{B(n)}_{\mathcal C}$?

If there exists an index $i$ with $h(x^{(i)})\ne c(x^{(i)})$, then under $D'$ we have

\[\Pr_{x\sim D'}[x=x^{(i)}]=\tfrac{1}{m}\]

so the agreement probability is

\[\Pr_{x\sim D'}[h(x)=c(x)] = 1-\tfrac{1}{m}\]

(i.e., the error is $1/m$). Therefore, achieving error $\le \varepsilon=\tfrac{1}{2m}$ forces zero error on $D'$, hence $h$ is consistent with all $m$ examples.